Matching Patients to Clinical Trials with Large Language Models

Clinical trials are often hindered by the challenge of patient recruitment. In this work, we introduce TrialGPT, a first-of-its-kind large language model (LLM) framework to assist patient-to-trial matching. Given a patient note, TrialGPT predicts the patient’s eligibility on a criterion-by-criterion basis and then consolidates these predictions to assess the patient’s eligibility for the target trial. We evaluate the trial-level prediction performance of TrialGPT on three publicly available cohorts of 184 patients with over 18,000 trial annotations. We also engaged three physicians to label over 1,000 patient-criterion pairs to assess its criterion-level prediction accuracy. Experimental results show that TrialGPT achieves a criterion-level accuracy of 87.3% with faithful explanations, close to the expert performance (88.7%–90.0%). The aggregated TrialGPT scores are highly correlated with human eligibility judgments, and they outperform the best-competing models by 32.6% to 57.2% in ranking and excluding clinical trials. Furthermore, our user study reveals that TrialGPT can significantly reduce the screening time (by 42.6%) in a real-life clinical trial matching task. These results and analyses have demonstrated promising opportunities for clinical trial matching with LLMs such as TrialGPT.


Introduction
Clinical trials examine the eeectiveness of medical interventions and provide crucial evidence that can be used to guide clinical practice.They also oeer an opportunity for participants to receive experimental treatments that could potentially improve their health outcomes.However, matching patients to suitable clinical trials can be a challenging process [1][2][3] .This process includes analyzing a patient's medical history, understanding the eligibility criteria of each clinical trial, and ensuring a match that satisfies both patient needs and trial requirements.As such, manually matching patients and clinical trials is often laborintensive, time-consuming, and prone to human errors.
Recently, artificial intelligence (AI) has shown promise in improving the eeiciency and accuracy of patient-trial matching 4 .Based on the directionality, there are two types of patient-trial matching tasks.The "trial-to-patient" scheme matches one trial to a list of candidate patients, which is a common need for clinical trial organizers and can be done by converting the trial criteria to structured query languages and searching the patient database [5][6][7] .On the other hand, the "patient-to-trial" scheme matches one patient to a list of candidate clinical trials [8][9][10][11] .In this study, we focus on the patient-centric "patient-to-trial" scheme because such a model can empower individual patients as well as referral oeices to explore a large set of potentially eligible clinical trials.However, the heterogeneity and ambiguity inherent in patient records and clinical trial criteria induce significant challenges for AI algorithms.Prior eeorts encoded patient records and trial criteria into dense embeddings using neural networks, aiming to represent them in the same embedding space that enables patient-trial matching through similarity search [12][13][14] .Nonetheless, training neural networks with the language understanding capability of ambiguous criteria texts and diverse patient records requires large datasets.This is often infeasible due to the lack of paired patient-criterion matching annotations.Besides, the black-box dense retrieval process is not explainable, and thus hard to debug when extrapolating it to previously unseen criteria and patients.
In this work, we aim to evaluate how recent large language models (LLMs) such as GPT-4 15 can aid the process of patient-to-trial matching in a data-eeicient and transparent way.LLMs are transformer-based models 16 that can understand a given context and generate humanlike responses accordingly.They have shown state-of-the-art capabilities in both the general domain 15,17 and biomedicine 18 , including question answering [19][20][21][22][23][24] and clinical trial design 25,26 .
Several pilot studies have also explored using LLMs to enhance the first-stage retrieval of clinical trials through information extraction 27 , perform data augmentation with clinical trial criteria 28 , and structuralize clinical trial criteria 29 , while our study pivots on rationalizing the criteria-level predictions and the fine-grained patient-to-trial ranking with LLMs.
We propose TrialGPT, a novel LLM framework for patient-to-trial matching.As shown in Figure 1, TrialGPT handles two challenging tasks: (a) predicting patient eligibility for each clinical trial criterion with explanations, (b) aggregating the criterion-level predictions to generate a trial-level score for downstream applications.Specifically, given a patient note and a candidate clinical trial, TrialGPT predicts three elements for each eligibility criterion: (1) a natural language explanation showing the relevance of the patient to the criterion, (2)   locations of relevant sentences in the patient note that are relevant to the target criterion, and (3) the eligibility classification indicating whether the patient meets this criterion.Three domain experts evaluate TrialGPT on 1,015 patient-criterion pairs, and the results show that TrialGPT can accurately explain patient-criterion relevance, locate relevant sentences, and predict criterion-level eligibility with an accuracy close to that of human experts.We then evaluate the trial-level scores by TrialGPT on three publicly available cohorts of 184 patients and 18,238 manually annotated clinical trials.Experimental results show that the aggregated-TrialGPT scores are highly correlated with expert eligibility annotations.They can be used to match eligible trials with patients eeectively and exclude ineligible trials, with a performance of 11.3% to 27.4% higher than the best baselines.
We also conducted a pilot user study that mimics the actual clinical trial matching task at the National Cancer Institute (NCI).In the evaluation, each patient-trial pair is evaluated by one medical expert with TrialGPT and another one without TrialGPT.We also ensure that each medical expert annotates half of the pairs with TrialGPT and half without to mitigate the skill dieerences between the annotators when computing the time reduction.The overall time saving for all patient-trial pairs is about 42.6% (from 61.5 seconds to 31.3 seconds), which shows the potential to greatly enhance the eeiciency of the clinical trial matching process.

Cohort characteristics
To evaluate TrialGPT, we use the patient notes and clinical trials from three publicly available cohorts: a test collection for patient-trial matching published by Special Interest Group on Information Retrieval (SIGIR) in 2016 9 , and the 2021 and 2022 Clinical Trials (CT) tracks 8 of the Text REtrieval Conference (TREC).For each patient, we sample at most 50 clinical trials for each eligibility category: the TREC CT cohorts have "eligible", "excluded" ("ineligible") and "irrelevant" trials, while the SIGIR cohort does not have the "excluded" trials.The baseline statistics of these patient cohorts are shown in Table 1.We use the combination of these three cohorts as the final evaluation corpus.For evaluating criterion-level prediction accuracy, three physicians were recruited and manually annotated 1,015 patient-criterion pairs regarding three output elements as shown in Figure 1a: (1) the correctness of TrialGPT relevance explanation between the given patient and the criterion, (2) the relevant sentence locations in the patient note, and (3) the criterionlevel prediction of the given patient's eligibility.Consensus annotations derived from individual annotations and further discussions are used as the ground truth.

TrialGPT achieves a high criterion-level prediction accuracy
As shown in Figure 1a, TrialGPT first generates the rationales and the relevant sentences for each criterion.Then, it predicts the criterion-level eligibility classification based on the rationales.TrialGPT assigns each inclusion criterion a label within {Included, Not included, Not enough information, Not applicable} and each exclusion criterion a label within {Excluded, Not excluded, Not enough information, Not applicable}.The label definitions can be found in the Online Methods.

Evaluating relevance explanations
We show the percentage of "correct", "partially correct" and "incorrect" TrialGPT explanations in Figure 2a.Overall, most explanations are "correct" (87.8%) by manual evaluations, while less than 10% of explanations are "partially correct" (9.66%), and only a small proportion are "incorrect" (2.56%).We also found that most of the incorrect explanations are for criteria labeled as "not included" and "not excluded", which usually require implicit inference.TrialGPT exhibits much fewer mistakes when the criteria are explicitly "included" or "excluded".These results suggest that TrialGPT can eeectively explain how a patient is relevant to an eligibility criterion.

Evaluating relevant sentence locations
We further compare the relevant sentences predicted by TrialGPT against the ground-truth expert annotations of relevant sentence locations.As shown in Figure 2b, the TrialGPTpredicted sentence locations are 90.1% correct (precision) and cover 87.9% of the groundtruth relevant sentence IDs (recall), leading to an F1 score of 88.6%.The performance of TrialGPT is close to that of human experts, ranging from 86.9% to 91.5%.This shows that TrialGPT can faithfully locate relevant sentences in patient notes, which further improves TrialGPT's explainability.

Evaluating eligibility prediction
Finally, we evaluate the criterion-level eligibility labels predicted by TrialGPT and individual human annotators against the ground-truth annotations.Figures 2c and 2d show the confusion matrices for these predictions.For the inclusion criteria, TrialGPT reaches a prediction accuracy of 0.899 for all four labels, which is within the experts' accuracy range from 0.876 to 0.916.For the exclusion criteria, while the accuracy is high for criteria labeled with "excluded" (1.00) and "not applicable" (0.98), TrialGPT tends to confuse among "not excluded", "no relevant information", and "not applicable".TrialGPT achieves an accuracy of 0.859 on the exclusion criteria.These results suggest that TrialGPT can accurately predict patient eligibility at the criterion level, with a performance close to that of human experts.

TrialGPT mostly makes reasoning-related errors
We further inspected the 26 criterion-level predictions that are labeled as "Incorrect" by annotator consensus.Four types of errors have been identified: (E1) Incorrect reasoning, where TrialGPT predicts "not enough information" but the matching result can be implicitly inferred; (E2) Lack of medical knowledge, such as not knowing "A" is "B" or "A" is a type of "B"; (E3) Ambiguous label definitions, where the model confuses between two annotations, e.g., "not enough information" and "not applicable"; (E4) Other unclassified errors.
Supplementary Table 1 shows the proportion and example of each error type made by TrialGPT.Most (30.7%) of the errors are due to incorrect reasoning, followed by ambiguous or redundant definitions of the eligibility classes (26.9%).Lack of medical knowledge contributes to about 15.4% of the total errors.These results suggest that improving the medical capabilities of the backbone LLM is an important future direction.

Aggregated TrialGPT scores correlate with trial-level eligibility
TrialGPT has been shown to achieve high prediction accuracy at the criterion level.However, since one clinical trial typically has many inclusion and exclusion criteria, trial-level scores should be computed as a way to decide the extent to which a given patient is eligible or ineligible.In this section, we evaluate how criterion-level predictions of TrialGPT can be aggregated into trial-level scores (component shown in Figure 1b).For this, we analyze the correlations between patient-trial eligibility and eight trial-level scores, which are computed by two types of methods: linear aggregations and LLM aggregations.The results are presented as box plots in Figure 3.

Linear aggregations
Six scores are computed by counting the percentages of the criterion-level eligibility predictions of TrialGPT.Their correlations with trial-level eligibility labels are shown in Figure 3a-f.Figure 3a shows the percentage of inclusion criteria predicted as "included" by TrialGPT.As expected, Figure 3a implies that the patients meet the highest percentages of inclusion criteria in eligible clinical trials and meet the lowest percentages of inclusion criteria in irrelevant clinical trials.The percentage of met inclusion criteria falls in between for relevant but ineligible trials.Figure 3b shows the percentage of inclusion criteria predicted as "not included", which follows the reverse trends of the met inclusion criteria (Figure 3a).Noticeably, no inclusion criterion is classified by TrialGPT as "not included" in eligible patient-trial pairs, confirming the correctness of the model.Figure 3d shows the percentage of exclusion criteria predicted as "excluded".Interestingly, patients meet more exclusion criteria in ineligible clinical trials than in irrelevant and eligible clinical trials, unlike other graphs that are either monotonically increasing or decreasing with regard to the irrelevant-ineligible-eligible order.This is a characteristic feature of patient-trial pairs that are explicitly excluded and can be exploited in patient-trial matching.

LLM aggregations
We also use LLMs to further aggregate the criterion-level predictions of TrialGPT, resulting in two scores.The general relevance score (0~100) is shown in Figure 3g, where the irrelevant patient-trial pairs are much lower than the other two groups.Eligible and ineligible patienttrial groups have certain overlaps, but the former is still significantly higher than the latter.
The eligibility score (-100~100) is shown in Figure 3h, where negative scores denote  In summary, criterion-level TrialGPT predictions can be aggregated into trial-level scores that are highly correlated with patient-trial eligibility.The results of linear aggregations demonstrate that eligible patient-trial pairs have the highest proportions of met inclusion criteria and unmet exclusion criteria, while ineligible patient-trial pairs have the highest proportions of met exclusion criteria.In addition, the LLM aggregations are also significantly correlated with the manual eligibility labels.These results suggest that the aggregated scores of TrialGPT can be potentially used to rank or exclude a list of candidate clinical trials for a given patient.

TrialGPT can e=ectively rank and exclude candidate clinical trials
In this section, we evaluate TrialGPT on ranking a list of candidate clinical trials and excluding ineligible clinical trials for a given patient (component Figure 1c).Based on the correlation analysis, we design a suite of scoring methods to aggregate criterion-level predictions of TrialGPT to generate a trial-level score for ranking the candidate trials.Table 2 shows the Normalized Discounted Cumulative Gain at rank 10 (NDCG@10), Precision at rank 10 (P@10), and Area Under the Receiver Operating Characteristic curve (AUROC) of dieerent methods in comparison to state-of-the-art models, which are described in Online Methods.

Ranking candidate clinical trials
As shown in Table 2, TrialGPT outperforms all compared baselines, including dual-encoder, cross-encoder, and encoder-decoder models trained on dieerent biomedical and clinical natural language inference (NLI) 30 datasets.The best baseline for ranking clinical trials is the cross-encoder BioLinkBERT 31 trained on MedNLI 32 , which achieves the NDCG@10 of 0.5558 and the P@10 of 0.4663.The most eeective features of TrialGPT for ranking are the LLMaggregated scores.They achieve NDCG@10 of 0.7339 (by Relevance) and P@10 of 0.5660 (by Eligibility), which are much higher than other aggregations.Combining both linear and LLM aggregations yields the highest-ranking performance, with the NDCG@10 of 0.8165 and the P@10 of 0.7328.Unlike in the task of ranking clinical trials, the percentage of inclusion criteria predicted as "not included" and the percentage of exclusion criteria predicted as "excluded" also achieve comparable AUROC individually.Again, TrialGPT outperforms all baselines, and the combination of the features achieves an AUROC of 0.7749.For evaluation, we use three publicly available patient-trial matching datasets, where the patients are represented by a paragraph of free-text clinical summary.However, clinical trial matching sometimes requires the recruiters to check the patients' information more comprehensively, which involves longitudinal clinical notes, lab values, and even imaging data.This requires the model to (1) attend to much longer contexts, (2) process structured data, and (3) process multi-modal inputs.These aspects have not been evaluated by this study but are worth exploring in future work.We also notice that the overall task formulation of the SIGIR and TREC datasets might be over-simplified, as many trial restrictions are not considered (geolocations, recruitment status, etc.), and the eligibility annotation is not strict at handling criteria with no relevant information.

Excluding ineligible clinical trials
Our work supports the position that the AI models for clinical trial matching should not be designed to replace human recruiters but to empower them, and experts should always be in the loop of medical AI deployments.Evaluation in real-life clinical trial matching scenarios should also focus more on eeiciency improvement for human recruiters, instead of solely reporting the prediction performance.In this sense, the explanation capability of TrialGPT, or more generally LLMs, is particularly helpful.This is exemplified by our pilot user study showing that TrialGPT can significantly reduce 42.6% of the screening time on average.

Backbone LLM
TrialGPT is LLM-agnostic, meaning it can be plugged into dieerent backbone LLMs.In this study, we use the GPT-4 API (model index: gpt-4, 0613 version, 8k context length) through Microsoft Azure's OpenAI services.We set the inference temperature to 0 for deterministic outputs.

Criterion-level prediction
The objective of this module is to output a free-text relevance explanation , a list of relevant sentence IDs , and the eligibility prediction  for each criterion based on the input patient note.For an inclusion criterion,  ∈ not included, not enough information, not applicable}, while for an exclusion criterion,  ∈ {excluded, not excluded, not enough information, not applicable}.
We use dieerent label sets for inclusion and exclusion criteria because the latter are often ambiguous.For example, exclusion criteria of "Pregnancy," "The patient should not be pregnant," and "Pregnant patients will be excluded" serve the same purpose.Traditional entailment labels might not be suitable to distinguish the semantic dieerences, while our eligibility-oriented label sets provide an end-to-end solution.
We make two LLM inference calls for each patient-trial pair: one for all inclusion criteria, and another one for all exclusion criteria.Overall, the prompt includes the task description, the clinical trial background information Β, and the inclusion ([ !,  " , … ,  $ ]) and exclusion ([ !,  " , … ,  % ]) criteria.Motivated by chain-of-thought prompting 44 , we prompt the model to first generate the relevance explanation as grounding for future predictions of the relevant sentence IDs and the eligibility labels.In addition, we also prompt the model to generate criterion-level predictions in the JSON format, which can be easily parsed for aggregations.
The TrialGPT prompts are shown in Supplementary Tables 2 and 3.

Trial-level aggregation
After getting the criterion-level predictions, TrialGPT then aggregates such scores to generate a trial-level score that can be used for practical applications such as ranking and excluding clinical trials.Specifically, we denote the eligibility predictions of TrialGPT for the inclusion criteria and exclusion criteria as [( ! ), ( " ), … , ( $ )] and [( ! ), ( " ), … , ( % )], respectively.
Linear aggregation: six scores are simply derived based on the percentages of dieerent eligibility predictions.While more sophisticated scoring methods can be used, we intentionally use these simple and linear aggregation strategies for better probing the capabilities of LLMs.
For a trial's inclusion criteria: Correct" or "Incorrect".If at least two annotators provide the same label, it will be used as the consensus.If annotators choose three dieerent scores for a patient-criterion pair, it will be labeled as "Partially Correct".The annotators then annotate the relevant sentence locations for the criterion, and the consensus is the union of all annotator-provided sentences.Finally, the annotator provides the annotation of eligibility, with the same candidate label set as that of TrialGPT.Similarly, if at least two annotators assign the same eligibility label, it will be used as the consensus.If all three annotators assign dieerent eligibility labels, a second round of discussion will be scheduled until there is a consensus label.
Cross-encoder models take both the patient note and the criterion as input, which enables cross-attention computations between the tokens in both texts.The eligibility prediction is modeled as a 3-way classification task based on the special [CLS] embedding of BERT 46 .We use label space mapping functions  that maps an NLI label to an eligibility label: = (% unmet inclusion criteria > 0) + (% met exclusion criteria > 0) − % met inclusion criteria Encoder-decoder models also take both the patient note and the criterion as input to the encoder, but instead of outputting a classification prediction, they generate the predicted NLI labels, e.g., "entailment", "contradiction", or "neutral".These NLI labels are then mapped to eligibility labels that will be aggregated into the combination scores by the same methods described above for cross-encoder models.

Evaluation settings
We report the NDCG@10 and P@10 for ranking candidate clinical trials, and AUROC for excluding ineligible clinical trials.
For computing NDCG@10 and P@10, we denote the ranked list of clinical trials as [ !,  " , … ,  ) ], where  is the number of considered candidates.Their relevance scores are denoted as [ !,  " , … ,  ) ], which are converted following the settings of the TREC Clinical Trials tracks:  , = T 0, if ( , ) = irrelevant 1, if ( , ) = ineligible 2, if ( , ) = eligible NDCG@10 is a measurement ranking quality, which is computed by: NDCG@ = DCG@ IDCG@l where DCG@ = V  & log " ( + 1) -&*! investigation ("Maybe").Two MD annotators (Q.J. and E.X.) recorded the time needed to make the decision after familiarizing themselves with the patient note.Each annotator screens half of the patient-trial pairs with TrialGPT and another half without.We also ensure that each patient-trial pair is screened by one annotator with TrialGPT and another without.
The evaluation setting is visualized in Figure 4a.

Figure 1 .
Figure 1.The overall architecture of TrialGPT.(a) Given a patient note and a clinical trial

Figure 2 .
Figure 2. Manual evaluations of criterion-level predictions by TrialGPT.(a) The ineligible, positive scores denote eligible, and a score of 0 denotes neutral.Overall, the eligible patient-trial pairs have the highest scores, and the ineligible patient-trial pairs have the lowest scores.

Figure 3 .
Figure 3. Correlation between diJerently aggregated TrialGPT scores and the ground-

Figure 4 .
Figure 4. Results of the patient-trial matching user study.(a) Experimental design and this work.The candidate clinical trials for annotation are derived from pooling various retrieval methods.TREC 2021/2022 CTThe TREC 2021 and 2022 CT tracks contain 75 and 50 patients, respectively.These patient notes are synthetic patient case descriptions, "such as what may be included as part of an admission note".For each patient, they annotate three eligibility labels for the candidate clinical trial: irrelevant, excluded, and eligible.The candidate clinical trials are pooled from the submission systems of TREC participants.TrialGPTTrialGPT is an architecture for patient-trial matching with large language models.It is composed of three modules: 1) a backbone LLM, 2) a criterion-level prediction module, and 3) a trial-level aggregation module.Here we denote a patient note as a list of  sentences [ !,  " , … ,  # ], a clinical trial as composed of the background information Β (containing the title, conditions, interventions, and the brief summary ), a list of  inclusion criteria [ !,  " , … ,  $ ], and a list of  exclusion criteria [ !,  " , … ,  % ].

Table 1 . Baseline statistics of the three patient cohorts used in this work. We show the mean
± standard deviation for applicable variables.

Table 2
38so shows the AUROC of excluding candidate trials, which is modeled as a binary classification task.The best baseline for excluding clinical trials is the dual-encoder SapBERT 33 trained on MNLI34, SNLI35, SciNLI 36 , SciTail 37 , MedNLI 32 , and STSB38, achieving the AUROC of 0.5842.This result only shows marginal improvement over the random score baseline, indicating that the task of excluding ineligible trials presents significant challenges.

Table 2 . Performance of diJerent methods for ranking and excluding clinical trials.
The Sign() function assigns suitable signs for the corresponding task, e.g., for "% Included", it will be "+" for ranking and "-" for excluding clinical trials.
These experimental results show that TrialGPT can eeectively rank candidate clinical trials and exclude ineligible clinical trials, which could facilitate the trial-matching process.